Travel is an important activity that people in the United Kingdom (UK) participate in for leisure, business and visiting family. Choices are influenced by a number of personal and wider influences (such as the economic or political).
Individual travel choices can also have implications for the economy and environment. This makes understanding trends important. Additionally in the past 10 years the UK has seen a number of events which theoretically could have influenced these travel choices (e.g. Coronavirus pandemic, EU Exit).
Overall have people in the UK changed their chosen travel destinations over time?
The data was provided by the UK’s Office for National Statistics (ONS) and accessed in May 2023 at their Travel Trends Dataset Pages. The only available data was between 2009 and 2021. The specific files used were:
I have used the “Number of visits to specified countries: by main country visited and nationality” as this allowed a more nuanced view of travel than region.
The data used in this project provides estimated visits based on the International Passenger Survey (IPS). The IPS uses interviews at all major airports, sea routes and Eurostar/ Eurotunnel terminals with approximately 250,000 people included in the estimates (although up to 800,000 interviews take place). There are no other details on the sampling method. Countries are consistent for every year available.
However the data for 2021 is more limited than previous years of the IPS due to a smaller sample. Reasons for this included the smaller number of travelers and issues collecting the data at key terminals (e.g. Eurotunnel). Additionally a new method for calculating the 2021 estimates was introduced. Full details of both the changes and data distruption are in the raw data files.
Overall the scale and coverage of the IPS make it reliable for our research question, bearing in mind caveats around 2021. After reviewing the overlapping years within the data sets the number of visits matched; making a merge still appropriate. Finally other data sources of a better or similar quality were not available.
The initial data for this project was relatively clean and within a workbook of multiple statistics. The table below shows the first few rows of how the initial data for 2017 to 2021.
data1721 = (here("raw_data","section3ukresidentsvisitsabroad2017to2021.xlsx"))
#data for 2017 to 2021
raw<- read_excel(data1721,
sheet="3.06",
skip=10,
n_max=68)
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...8`
head(raw)
A full codebook describing the variables from the original data set and any calculated values are available in the github pages for this project.
The primary variables taken from the initial data are in the table below.
| variable | description | other_notes |
|---|---|---|
| country (1 in the raw data) | main destination travelled to from the UK | Source ONS raw data apart from the dfColour |
| 2009…2021 (each years individual variable) | number of visits in thousands (000s) | This includes business, leasure and family travel. In dfv2 years correspond to ranks |
This section covers the different processes that this data has been through prior to visualisation.
The following packages were used in this project:
renv::restore()
#Libraries
library(here)#Here for loading in the data
library(tidyverse)#Tidyverse for data management
library(ggplot2) #chart making
library(readxl)#Reading in Excel sheets as that is how the data is stored
library(plotly)#Making scroll over charts
library(htmlwidgets)#To save interactive chart
library(knitr) #To make a table for the RMarkdown
To support future replication this R Project used RENV (version:0.17.3) to keep packages consistent and details together. An additional document with package citations is available in the following file location.
As an initial step a set of values was created to support the importation and data cleaning processes. In theory these will allow for new data sets to be added in a simple way.
# 2. Load the data in
#Variables required for data cleaning of the ONS files
ONSList <- 68 #The length of the ONS data table, these are consistent
yearsdf1721 <- c("2017","2018","2019",
"2020","2021") # These are the years that we will want to take from this data frame
yearsdf0919 <- c("2009","2010","2011",
"2012","2013","2014",
"2015","2016") #Years wanted, always choose most recent
summaries <- list('Total World',
'Other Countries',"Europe",
'- of which EU',
'- of which EU Oth',
'- of which EU15',
"North America") #summary values used in the dataset
# Raw data locations
data1721 = (here("raw_data","section3ukresidentsvisitsabroad2017to2021.xlsx")) #data for 2017 to 2021
data0919 = (here ("raw_data","section3ukresidentsvisitsabroad2009to2019.xlsx")) #data for 2009 to 2017
The following code imports the data into R. It then renames the columns to match those in the original ONS data file and removes the rows that are blank.
# Load in latest dataset 2017-2021
df1721<- read_excel(data1721,
sheet="3.06",
skip=10,
n_max=ONSList)
##rename columns 2017-2021
df1721_n <- tibble(x = 1:9, y = c("country", "coltoremove", "2017","2018",
"2019","2020","2021","blankroremove",
"avgrowth1519"))#list of column names
names(df1721) <- df1721_n %>% select(y) %>% pull()#function to change the names
##Remove NAs in rows and select years only 2017 to 2021
df1721 <- df1721 %>%
select("country",all_of(yearsdf1721)) %>% #Select correct years
drop_na("country")# remove blank rows
## Read the excel file for 2009 to 2019
df0919 <- read_excel(data0919,
sheet="3.10",
skip=10,
n_max=ONSList)
###rename columns 2009 - 2019
df0919_n <- tibble(x=1:19, y= c("country",
"coltoremove",
"2009","2010","2011",
"2012","2013","2014",
"2015","2016","2017",
"2018","2019",
"blank1",
"change1819",
"blank2",
"growth1819",
"blank3",
"Avgrowth1519"))#list of column names
names(df0919) <- df0919_n %>% select(y) %>% pull() #function to change the names
###Remove NAs in rows and select years only to 2017
df0919 <- df0919 %>%
select("country",all_of(yearsdf0919)) %>% #Select correct years
drop_na("country")# remove blank rows
Once both data sets were in the same shape they were them into a single data frame that has all available years. This was then when I removed the summary variables to avoid uneccessary code.
#Merge the datasets
df0921 <- left_join(df0919, df1721,by="country")
##Remove summaries in the dataset
df0921 <-df0921 %>%
filter((!country %in% summaries))
As 58 countries is tricky to visualise I knew a rank variable would be required to develop the visualisation. I decided on 2021 to be the key year for this cut off because it is the most recent and presents a “now” vs the “past” view which is more intuitive. The ranks work on 1 being the most visited country.
## Add ranking variable for 2021 (this will be used to decide on the cases kept)
df0921 <- df0921 %>%
mutate(rank_cut = rank(-`2021`, ties.method = "average"))
#save combined data set
comData_n = paste(here("created_data"), "/travel0921.csv", sep = "")
write.csv(df0921, comData_n) #Write data to CSV
#check data
head(df0921)
I decided that there were three approaches to visualising the data:
Visits and ranks were seen as the most promising for visualisations as proportions would require more of the countries to be included in the plot. It was also assumed that an audience would be more interested in the most visited rather than proportions.
Other considerations:
With visualisation 1 I wanted to use the raw number of visits to show the change in travel. I decided on a dumbell plot because I felt that it showed the differences between both 2009 and 2021 and the countries well. It is also less cluttered than a traditional clustered bar chart and allowed for easier comparisons between countries.
Initially I created variables to select the years and the number of countries I wanted included. This was to allow for any quick changes based on how the final graph was looking. After this the data was then moved into a long format with the year column name becoming the year variable. The number of visits for each year is included as visits.
#3.1. Create variables to allow us to choose the number of 2021 ranks
cutrank1 <- 20 #number of countries that we want included in the chart
yrs_inc1 <- c(2009, 2021) # years that we want to include (can be any year from 2009-2021 apart from 2020)
#3.2 Create summary DF based on the number of countries
dfv1 <- df0921 %>%
filter (rank_cut<=cutrank1)
#3.3. Convert to long format
longdfv1 <- dfv1 %>% select("country",
"2009", "2010",
"2011", "2012",
"2013", "2014",
"2015", "2016",
"2017", "2018",
"2019", "2021") %>% #2020 dropped due to no data
pivot_longer(!country,
names_to = "year",
names_transform = list(year = as.factor), #Changed to factor to allow for it to be treated as a category
values_to = "visits") %>%
arrange(desc(visits)) %>% #order by the most visited to the least
filter(year %in% yrs_inc1) #removes years that we are not interested in as defined above
head(longdfv1)
To create the dumbell plot I added the visits data to geom_point and then added a line to connect the 2 dates together. As some data points overlap I ensured that these points were opaque. Additionally I created a dynamic subtitle in case the years chosen changes in the future (e.g. if we decide that the 2019 pre-covid year is a more appropriate comparison). I kept the colours as default as I felt that they actually worked quite well. The countries were reordered by the largest visits but this had mixed results.
#3.4. Create a dumbbell plot
#3.4.1. Create a subtitle that changes with the data selected.
subtitle1 <- paste("Showing the top", toString(cutrank1), "most visited countries in 2021")
#3.4.2. Match the aesthetics (basic plot)
p1 <- ggplot(longdfv1,
aes(x = visits, y = reorder(country, visits)))
#3.4.3. Add layers
v1 <- p1 + #lines to show the years are connected
geom_line(color="grey") +
#shows the year
geom_point(aes(color = year), size = 3, alpha=0.8) +
#Legend at the bottom for ease
theme(legend.position = "bottom") +
#Labels for the chart
labs(x = "Number of visits from the UK (000s)",
y = "Country",
title = "Number of visits abroad from the UK in 2009 and 2021 by main country visited",
subtitle = subtitle1,
caption= "Source: Office for National Statistics International Passenger Survey")
#3.5. Save the dumbbell plot
ggsave(here("plots", "dumbell.png"), v1, width = 11.2, height = 7.163)
As I felt that this plot made reading the actual number difficult I entered it into ggplotly to allow for scrolling over points. This was useful but did make some of the formatting less clear.
#3.6. Add the dumbbell to plotly to make it interactive
ip1 <- ggplotly(v1) #Use plotly for scrollover
#Display interactive visualisation 1
ip1
#3.7. Save the interactive plot
saveWidget(ip1, file='interactivedumbell.html')
#Image for PDF markdown (add if required)
#screen <- here("documents", "printscreen_ip1.png") #location of file
#knitr::include_graphics(screen)
From this visualisation we can see that most countries have seen a reduction in the number of visits from 2009 to 2021. This is especially true for Spain and France but they had a much higher number of visitors in 2009. Only Romania has a clear increase in visits. Although this is useful I don’t think it shows the changes in travel patterns in an intuitive way. Additionally it mostly confirms that travel during the end of the COVID-19 pandemic was lower which is not particularly insightful.
After reviewing plot 1 I decided that ranks would show the same information in a simpler way and provide a way to understand trends despite the impact of COVID-19 on traveler numbers. The trade off was losing the scale of difference between Spain and the rest of the countries.
I decided on a slope chart as this provides the most information without it being too cluttered. I was also conscious that other alternatives such as a line graph or paired bar chart would be difficult to interpret.
As with visualisation 1 I made variables to allow for easy customisation of the number of countries/ years. Prior to visualising I needed to convert the data from actual visits to the rank position for that year. Following on from that I needed to select the number of countries that I wanted included and then convert to the long format. Finally I selected only cases in 2009 and 2021.
#4.1. Create variable to allow for filtering the data
cutrank2 <- 15 # Create variables to allow us to choose the number of 2021 ranks
yrs_inc2 <- c(2009, 2021) #Create variables for the years to include in visualisation
#4.2. Create new DF of just ranks
dfv2 <- df0921 %>%
mutate(r09 = rank(-`2009`, ties.method = "average"),
r10 = rank(-`2010`, ties.method = "average"),
r11 = rank(-`2011`, ties.method = "average"),
r12 = rank(-`2012`, ties.method = "average"),
r13 = rank(-`2013`, ties.method = "average"),
r14 = rank(-`2014`, ties.method = "average"),
r15 = rank(-`2015`, ties.method = "average"),
r16 = rank(-`2016`, ties.method = "average"),
r17 = rank(-`2017`, ties.method = "average"),
r18 = rank(-`2018`, ties.method = "average"),
r19 = rank(-`2019`, ties.method = "average"),
r20 = r19, #no data so saying that it would have been steady from 19
r21 = rank(-`2021`, ties.method = "average")) %>% #r09-#r21 are the rank scores for each year
select("country",
"r09", "r10", "r11",
"r12", "r13", "r14",
"r15", "r16", "r17",
"r18", "r19", "r21",
'rank_cut') %>% #selecting the ranked data but 2020 dropped due to no data
rename('2009'='r09',
'2010'='r10',
'2011'='r11',
'2012'='r12',
'2013'='r13',
'2014'='r14',
'2015'='r15',
'2016'='r16',
'2017'='r17',
'2018'='r18',
'2019'='r19',
'2021'='r21',
'rank_cut2'='rank_cut') #changing rank names to years to allow for changing to long format
#4.3. Create summary DF based on the number of countries
dfv2 <- dfv2 %>%
filter (rank_cut2 <=cutrank2)
#4.4. Change to Long format
longdfv2 <- dfv2 %>%
select("country",
"2009", "2010",
"2011", "2012",
"2013", "2014",
"2015", "2016",
"2017", "2018",
"2019", "2021") %>%
pivot_longer(!country,
names_to = "year",
names_transform = list(year = as.factor), #Changed to factor to allow for it to be treated as a category
values_to = "rank") %>%
filter(year %in% yrs_inc2) #remove years that are irrelevant
head(longdfv2)
An additional factor in the choice of country is the geographic region that it is in. To account for this I created a colour scheme where each country grouping (based on the ONS National statistics country classification). To develop this scheme each grouping got their own main colour with each individual country showing a different shade. The table below shows the data format.
#4.5.1. Retrieve data file containing the colours for each country/ continent
dataColour <- here("created_data", "colours.csv") #location of file
dfColour <- read.csv(file = dataColour) # read in data
head(dfColour)
This was then combined with the long rank data.
#4.5.2. Match the colour file to the data by country
longdfv2 <- left_join(longdfv2, dfColour,by="country")
head(longdfv2)
To create the final visualisation I used a line chart added to points at the end of each line to highlight the colour and position changes. To support understanding which countries are in which order I added the country name and rank next to the chart.
Other details:
#4.6. Create the final visualisation (slope plot)
#4.6.1. Create titles for the plot
subtitle2 <- paste("Changes in the rank for overall visits (business and travel) for the top",
toString(cutrank2),
" most visited countries from the UK between"
,toString(first(yrs_inc2)),
"to" ,
toString(last(yrs_inc2))) # Label explaining the data
Source2 <- bquote(paste(~bold('Source:'),"ONS International Passenger Survey",
~bold('Note:'),"Colours indicate country grouping; 2021 had disrupted data collection")) #data source and interpretation notes
title2 <- paste("Viva Espania! Spain remains the UK's most visited country")
#4.6.2. Match the aesthetics (basic plot)
p2 <- longdfv2 %>%
ggplot(aes(x=year, y=rank,group=country, colour= country))
#4.6.3. Add the formatting
final <- p2+
#slope lines
geom_line(alpha = 0.5, linewidth=1.5) +
#End points for the data
geom_point(size = 3.5) +
#Labels to show the country and rank next to each of the countries
geom_text(data= longdfv2 %>% filter(year==(toString(first(yrs_inc2)))), #selects the first year
aes(x=year, y=rank,label = country), #add the country name
size = 4, nudge_x = -0.35, fontface = "bold", color = "#000000") + # format so next to text
geom_text(data= longdfv2 %>% filter(year==(toString(last(yrs_inc2)))), #selects the last year
aes(x=year, y=rank,label = country), # add the country name
size = 4, nudge_x = 0.35, fontface = "bold", color = "#000000") + #format so next to text
geom_text(data= longdfv2 %>% filter(year ==(toString(first(yrs_inc2)))), #select the first year
aes(x=year, y=rank,label = rank), #add the rank position
size = 4, nudge_x = -0.05, fontface = "bold", color = "#000000") + #place next to country name
geom_text(data= longdfv2 %>% filter(year ==(toString(last(yrs_inc2)))), #select the last year
aes(x=year, y=rank,label = rank), # add the rank to the chart
size = 4, nudge_x = 0.05,fontface = "bold", color = "#000000") + #formatting
#Reverse the scale so that the lowest ranked (most visited) is at the top
scale_y_reverse( breaks=NULL) +
#Use custom colours
scale_color_manual(values = setNames(longdfv2$hex_value, longdfv2$country)) +
#Add labels
labs(x = "Year", #X-axis
y = "Rank visits from the UK (1 is the most visited)", #y-axis
title = title2, #main title
subtitle = str_wrap(subtitle2, width=150), #subtitle
caption= Source2) + #caption with each on a new line
#Specify the theme (fonts, background etc.)
theme(plot.title = element_text(size=18, face = "bold"), #Main title font
axis.text.x = element_text(size=14, face="bold", colour="black"), #X-axis element font
axis.text.y = element_text(size=12, face="bold"), #y-axis years font
axis.title = element_text(size=14,face = "bold"), #Axis title fonts
panel.background = element_blank(), #Make the background blank
legend.position = "none",#Remove the legend
plot.caption = element_text(size = 10, hjust = 1)) #Caption font
#4.6.4. Save the final plot
ggsave(here("plots", "finalplot.png"), final, width = 13.44, height = 8.595)
The slope chart of ranked data is much clearer to interpret because the steepness of the slopes shows the trend. Overall there has been a slight change in travel patterns but this is not drastic or shifting the areas of the World that people are visiting.
However more specific insights include:
The main reflection that I have on the visualisations is the difficulties in trying to provide all the information available in a simple format. Although it can seem inappropriate to lose data on actual visits and the scale of the Spanish visits this was necessary to see the big picture.
With visualisation 1 I think that the chart type and overall look is good. However the difficulties managing the scale and getting the countries to order as planned made it not fully effective.
I think that visualisation 2 is sufficient to get an overall idea of the trends with UK travel. I like the colour scheme and that it looks simple whilst containing the necessary information. Likewise the subtitles that automatically change were another strength.
The final visualisation does have some limitations:
If there was more time and access to the full data set for the IPS since 1961 I would have liked to create an interactive dashboard of the ranked travel data. This would allow people to type in their countries of interest and see longer term trends. This could also allow people who are interested to click on the position to see the actual number of visits. An example of this is provided by the ONS on 100 years of baby names.
From the two visualisations the travel choices in the UK appear to have remained relatively consistent in the UK. However there do appear to be some changes that would hopefully become more clear in the future.
Additionally all the different free R resources online and stackoverflow have been useful in producing these charts.